Introduction

This research included data and analysis on incarceration rates in the United States from 1970 to 2018. The variable of interest will be ‘total jail pop,’ which represents the total number of people in prison. Furthermore, ‘year’ and’state’ variables will be included in the dataset for analysis so that a comparison may be made by year and state. In addition, variables that provide rates for populations of various races in jail will be included in the dataset so that race and incarceration may be compared. Finally, the dataset includes the ‘land area’ and ‘urbanicity’ variables to show a link between ‘total jail pop’ and ‘land area’ in terms of whether areas are urban, sub-urban, rural, or mid/small.

Summary statistics

Table 1 summarizes the study’s numeric variables. The highest value of ‘total jail pop’ is 23467.19, while the smallest value is 0. The mean value of total jail population across all states and years is 161.121, with a standard deviation of 615.541, suggesting that observations in the variable are considerably distributed.

When comparing the averages of the jail population rates for different races, it can be seen that African Americans have the highest mean of 4393.668, meaning that African Americans have the greatest proportion of their population in jail of all races in the US. Native Americans are next, followed by Latin Americans, and finally white Americans, who have the lowest percentage of their population in prison.

All variables are severely skewed and have large tails, indicating that a log transformation is necessary to properly view the dataset, according to skewness and kurtosis statistics. A log transformation is a data transformation in which each variable x is replaced with a log (x). This inquiry will focus on the natural log transformation. The symbol ln represents the nature log. When continuous data does not follow the bell curve (normal distribution), it is possible to log convert it to make it as “normal” as possible, improving the statistical analysis findings. To put it another way, the log transformation reduces or removes the original data’s skewness.

# Read data
inc <- read_csv("incarceration_trends.csv")

# Select variables of interest
my_dat <- inc %>% 
  select(state, year, total_jail_pop, land_area, black_jail_pop_rate,
         white_jail_pop_rate, latinx_jail_pop_rate, native_jail_pop_rate,
         urbanicity)

## Summary statistics
summ <- basicStats(my_dat[,-c(1,2,9)])[c(3,4,7,8,13:16),]
t_summ <- as.data.frame(t(round(summ,3)))
names(t_summ) <- c("Min", "Max", "Mean", "Med", "Var", "St.dev", "Skew", "Kurt")

# Create table
knitr::kable(t_summ, caption = "Summary statistics of selected variables")
Summary statistics of selected variables
Min Max Mean Med Var St.dev Skew Kurt
total_jail_pop 0.00 23467.19 161.121 32.03 378890.4 615.541 15.665 377.657
land_area 2.05 145572.46 1125.570 616.81 13073141.6 3615.680 26.500 931.934
black_jail_pop_rate 0.00 2145000.00 4393.668 1138.15 523637334.7 22883.123 31.818 1983.550
white_jail_pop_rate 0.00 73887.49 318.500 209.88 828584.7 910.266 36.845 1974.251
latinx_jail_pop_rate 0.00 351162.79 1166.668 369.03 26871953.8 5183.817 21.304 752.294
native_jail_pop_rate 0.00 335500.00 1195.047 0.00 56915816.9 7544.257 19.707 549.899

Table 2 presents top 10 years with highest average total jail population. It is observed that the year 2008 had the highest number of incarcerations.

# Generate dataframe for average population by year. Select top 10 yrs
dat1 <- my_dat %>% group_by(year) %>% 
  summarise(Avg.Pop = mean(total_jail_pop, na.rm = T)) %>% 
  arrange(desc(Avg.Pop)) %>% head(10)
# Create table
knitr::kable(dat1, 
             caption= "Top 10 years with highest average total jail population")
Top 10 years with highest average total jail population
year Avg.Pop
2008 263.3505
2009 261.5556
2007 260.5793
2006 258.0700
2014 257.7243
2010 256.7282
2013 256.4916
2012 254.8216
2011 253.8052
2017 250.0884

Table 3 presents top 10 states with highest average total jail population. It is observed that the state of DC had the highest number of incarcerations with an average of 2315.2551.

# Generate dataframe for average population by state. Select top 10 states
dat2 <- my_dat %>% group_by(state) %>% 
  summarise(Avg.Pop = mean(total_jail_pop, na.rm = T)) %>% 
  arrange(desc(Avg.Pop)) %>% head(10)
# Generate table
knitr::kable(dat2, 
             caption = "Top 10 states with highest average total jail population")
Top 10 states with highest average total jail population
state Avg.Pop
DC 2315.2551
CA 1066.3031
AZ 564.1401
NJ 562.0384
MA 549.9082
FL 546.2006
NY 422.1652
MD 353.9957
PA 336.8743
LA 282.4048

Trend over time

The graph below shows how the average overall jail population has changed over time based on race. At first, the number of those incarcerated hardly changed. However, following 1984, there was a significant increase in incarceration rates, which peaked in 2008. This is followed by a steady decline in incarceration rates throughout 2018. The white race jail population climbed substantially from 2008 to 2018, while the black, Latino, and other race jail populations fell. In addition, white people are the most prevalent race in prison, followed by Black, Latino, Other, and Native Americans.

# Generate dataframe for average population by year
df1 <- inc %>% 
  dplyr::select(white_jail_pop, black_jail_pop, native_jail_pop, latinx_jail_pop,
                other_race_jail_pop, year) %>% 
  dplyr::rename(White = "white_jail_pop", Black = "black_jail_pop", 
                Native = "native_jail_pop", Latin = "latinx_jail_pop",
                Other = "other_race_jail_pop") %>% 
  reshape2::melt(id = "year") %>% dplyr::group_by(year, variable) %>% 
  summarise(Avg.Pop = mean(value, na.rm = T))
  


# Make plot
ggplot(df1, aes(x = year, y = Avg.Pop, color = variable)) +
  geom_line() + 
  labs(x = "Year", y = "Average population", color = "Race",
       title = "Evolution of average incacerations over the years") + 
        theme(plot.title = element_text(hjust=0.5))

Comparison between variables

The box plot below shows how the total jail population variable relates to the urbanicity variable. It is clear that rural areas have the smallest total jail population, whereas urban areas have the highest total jail population.

# Box plot
ggplot(my_dat, aes(x = urbanicity, y = log(total_jail_pop),
                   fill = urbanicity)) +
  geom_boxplot() +
  labs(x = "Region", y = "Log Total population",
       title = "Relationsip between region and total jail population") +
  theme(plot.title = element_text(hjust=0.5))

Geographical representation

Finally, the shape file of the United States is retrieved from https://urbaninstitute.github.io/urbnmapr/ and loaded into r to create a geographical representation of the variable of interest. The average total jail population per state is then calculated for all years, and the result is combined with the shape file data to create a dataframe for charting. The map below depicts the distribution of the average total jail population in the several states.It is clear that CA has the higher average total jail population, on the contrary, AK has lower average total jail population.

df2 <- my_dat %>% group_by(state) %>% 
  summarise(Avg.Pop = mean(total_jail_pop, na.rm = T)) %>% 
  dplyr::rename(state_abbv = state)


spatial_data <- left_join(df2,
                          get_urbn_map(map = "states", sf = TRUE),
                          by = "state_abbv") %>% 
  st_as_sf()


m <-  ggplot() +
  geom_sf(spatial_data, mapping = aes(fill = log(Avg.Pop))) +
  geom_sf_text(data = get_urbn_labels(map = "states", sf = TRUE), 
               aes(label = state_abbv), size = 2) +
  scale_fill_distiller("Log Average Population", palette="Spectral") +
  labs(title = "Average population by State", x = "Longitude", y = "Latitude") +
  theme(plot.title = element_text(hjust=0.5))
ggplotly(m)